Day12 - Delta Live Tables (DLT) 簡介 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 12

AI & Data

利用 Databricks 學習 ML/LLM 開發系列第 12 篇

Day12 - Delta Live Tables (DLT) 簡介

15th鐵人賽

jimmyliao

2023-09-27 17:33:41

513 瀏覽

分享至

Delta Live Tables 是一個 Delta Lake 的資料表，可以透過 SQL 來進行資料的操作。這個資料表可以透過 SQL 來進行資料的操作，並且可以透過 Delta Live Tables 來管理資料的處理流程，包含了 task orchestration，cluster management，monitoring，data quality，以及 error handling。

原本在作資料處理時，是使用一系列的 Apache Spark 任務來定義 data pipeline，現在可以透過 Delta Live Table 定義 Streaming tables 以及 Materialized views 以保持最新狀態。Delta Live Tables 為每個處理步驟定義的查詢來管理資料的轉換方式。還可以透過 Delta Live Tables 期望來強制執行 Data Quality，這允許定義期望的資料品質並指定如何處理未達到這些期望的記錄。

Delta Live Tables 的 Datasets 有哪些？

Streaming table
Materialized views
Views

Delta Live Tables pipeline 是什麼？

是一個 pipeline 包含了 materialized views 以及 streaming tables，並且透過 Python 或是 SQL 來進行宣告。

Delta Live Tables 會自動的去判斷這些資料表之間的相依性，並且確保更新的順序是正確的。

For each dataset, Delta Live Tables compares the current state with the desired state and proceeds to create or update datasets using efficient processing methods.

對於每一個 dataset，Delta Live Tables 會比較目前的狀態以及期望的狀態，並且透過有效率的處理方式來建立或是更新 dataset。

Delta Live Tables pipelines 設定可以分成兩個部分：

定義一個 collection of notebooks or files，這些 notebooks or files 會使用 Delta Live Tables syntax 來宣告 datasets。
設定 pipeline 的基礎設施，如何處理更新，以及如何儲存資料表在 workspace 上。

Ingest data with Delta Live Tables

這部份比較沒什麼特別需要提的，總之就是宣稱支援很多格式就是。

Monitor and enforce data quality

可以參考這篇: https://docs.databricks.com/en/delta-live-tables/expectations.html

Delta Live Tables 與 Delta Lake 的關係

Delta Live Tables 是 Delta Lake 的延伸，因為 Delta Live Tables 所建立以及管理的資料表都是 Delta tables，所以他們有相同的保證以及 Delta Lake 提供的功能。此外，Delta Live Tables 會在 Delta Lake 的基礎上，增加一些 table properties。

限制

所有被 Delta Live Tables 建立以及更新的資料表都是 Delta tables。
一個 Delta Live Tables 只能被定義一次，也就是說，一個 Delta Live Tables 只能被定義在一個 pipeline 上。
欄位的辨識不支援在 APPLY CHANGES INTO 的資料表上，並且可能會在 materialized views 的更新時重新計算。Databricks 官網也有說只建議在 Streaming tables 上使用 APPLY CHANGES INTO。
一個 Databricks workspace 限制 100 個同時的 pipeline updates。

Reference: